CORE-Bench OOD

mentions 1 type Person feed RSS

// recent coverage 1 mentions

04:00

2026-06-26

arxiv.org

artificial-intelligence

Life After Benchmark Saturation: A Case Study of CORE-Bench

Researchers at arXiv propose a multi-dimensional evaluation framework for AI agents beyond accuracy saturation, using CORE-Bench Hard as a case study. They introduce CORE-Bench v1.1 and an out-of-dist…

// co-occurs with top 4 entities

arXiv 1 CORE-Bench 1 CORE-Bench Hard 1 CORE-Bench v1.1 1